Introduction

When shopping for a bottle of wine, it can be difficult to choose due to all of the available options. These datasets about wine could offer some insight to make those decisions easier. I have a preference for red wine. So that’s where I’ll begin.

They contain a sampling of wine chemical composition and a quality rating. Quality is determined by an average of expert opinions. Compositional qualities are provided in a variety of measurements across 10 variables. In total, there are 6,497 observations with 1,599 red and 4,898 whites.

This dataset should give a good exploration between the relationship of wine quality and chemical composition.

The datasets explored in this analysis were found here for red and here for white. The accompanying documentation of the datasets can be found here.

Citations

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

[@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

Initial Investigation

First a quick look at the data structure.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

There’s not much to go on with just raw numbers. We’ll need to investigate the accompanying documentation to give context to these numbers. From the documentation we find the data set contains compositional measurements of various solutes found within a solution of wine. Also included is a measure of quality saved as an integer.

This investigation is primarily concerned with predicting wine quality from wine composition. It only makes sense to start with an investigation of the quality score. How utilized is the 0-10 quality scale?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

With a minimum of three and a maximum of eight, it’s fair to say the range is under utilized. With a median near the mean, there’s no reason to suspect skew. What does give me pause is the 1st and 3rd quartiles are separated by 1. This would suggest an inordinate occurrences of middling quality range. Histograms are cheap and will give a quick visualization to confirm.

It appears the dataset does primarily consist of wines of average quality. A pie-chart would provide an understanding of how homogenous the population quality is.

Wines of average quality definitely account for a large portion of our sample. Without performing a calculation, I estimate about 80% of the red wine sample are of average quality.

It might be valuable to include the sample of white wines to increase the sample size.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The datasets share a source. It makes sense they would have compatible variables. However, white wine does contain about two and a half times the sample size of red wines.

Remember, the primary purpose of adding the white wine dataset to the investigation is to increase the granularity of quality ratings. There is the possibility that including white wine will skew our data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Not much is different. The mean and 1st to 3rd quartiles are unchanged. On a successful note, the quality range is better utilized. We now have a population of wines rated as nine.

Perhaps the distribution of quality for white wines has greater utilization.

That’s diappointing. There is still a disproportionate number of average quality wines in the white sample.

That disproportion is more apparent in a pie chart. Still, A lesser percentage of white wines are considered average quality. It seems approximately 75% of the sample is rated at five and six compared to the 80% of reds. So there was some success in our attempt at gaining granularity.

What Next

There may not be enough of a sample size to predict if a wine is of exceptionally good or poor quality based on its composition alone. There does seem to be enough samples to determine the chemical make-up of an average quality wine.

Perhaps it will be enough to predict if a wine is of average quality. If it doesn’t, then it’s either very bad or very good. With this thinking, analysis on average vs. not-average may provide the best insight this data-set can provide. We might be able to say what makes a wine average.

Delving into the Dataset

Some rudimentary research on wine research turned up Wine Chemistry on Wikipedia. From this we see there’s a possibly of some missing measures. For example, phenolic compounds and proteins. The possibility of missing solute data is a cause of concern. However, The accompanying documentation makes the claim no compositional information is missing. We’ll proceed with the information we do have.

The datasets contain thirteen variables for each observation. Ten of those variables describe composition. Eight of the ten are measures of concentration. Alcohol content is provided as a percent. Density provides a measure of the entire solution. Some of the solute measures are the same. Some are of different units. These should be converted to equal units of measure. With equal units, we can make better compositional comparisons.

The remaining variables are not compositional measurements. X is an index that is not linked to another dataset. It can be discarded. Quality is provided as an integer. There may be cause to convert it to a factor. Converting quality to a factor would reduce the scale. This may complicate any future additions to the dataset. Finally there’s Ph. Ph is is a descriptor of acidity. It may be of interest to explore the relationship between total acidic composition and Ph.

Points of Interest

Of most interest are the measures of concentration. The mass per volume of each solute is assumed to have a corelative relationship to the wine’s quality. Comparing the ratios of a wine’s compostition correlated to the wine’s quality is the most interesting feature to investigate.

Another interest is the non-compositional value of Ph. There’s a few questions we can try to answer here. What commonality between Ph values exists at varied concentrations of acidic solute? Is there a relationship between Ph and quality?

With both red and white datasets, we can also investigate differences between the two types. How do their compositions differ? Do those compositional differences affect the qualitative scale between whites and reds?

Data Preparation

As previously noted, we will need to convert our compositional data to units of equal measure. First we should consider what options are available and which best fits our purpose.

The first option is mass per volume. The data is presented to us in this format. This could be a good measure to perform investigation. Grams per cubic meter may be a common measure to the drug markets. I’d like a unit of measure that is more widely understood.

Parts per million (ppm) is a commonly reported unit of concentration. This would make for easily reported values. What effect would a scale in the millions have on visualizations?

A solute’s percentage of a solution should cover these concerns. A summation of solute percent can be used to determine the ratio of solvent in each solution of wine. Percents should provide a more manageable scale, and are commonly understood.

If any of these assumptions about percents proves false, it will be easy to convert back to another better suited measure. 1% is equal to 10,000 ppm and 1 ppm is equal to 1 mg/L. We can also easily convert to grams per serving. There’s 33.8 ounces in a liter and 5 ounces in a serving.

Having considered the options, we can generate a working dataset. The following list should cover what has been decided on.

  1. Add a field to note the wine’s color.
  2. Merge the two datasets into one.
  3. Add a logical field to note if the wine is ordinary. (i.e. quality is 5 or 6)
  4. Convert density from grams per cubic centimeter to grams per cubic decimeter. (i.e. multiply by 1000)
  5. Convert Sulfur dioxides from milligrams per cubic decimeter to grams per cubic decimeter. (i.e. divide by 1000)
  6. Convert solute concentration values to percents. (i.e. divide their value by density and multiply by 100)
  7. Add fields for total solute and solvent percents.

Data Preparation Aftermath

Having cleaned up the data, I’d like to take note of the variables added. It is good to keep track of such things. Just in case we forget why they were added.

The sum of the percent solutes does not account for the total solution. With no information on the composition of the remaining solution, and the claim there are no missing attributes, we must assume the unaccounted solution is a tasteless solvent (e.g. water). We’re also making the assumption that alcohol is not a solvent for this case. Higher solvent content may be a factor in determining quality. We could hypothesize Watered down wine is of lower quality. These values will allow us to test these hypotheses.

Another variable we added retains information on which dataset the observation originated from. This will allow us to compare the differences between red and white wines. As someone with a preference for reds, I think this part is the most important.

The exceptional vs ordinary variable is intended for exploring what is common in wines of average quality. In this case, exceptional is defined as not average. Exceptionally bad and exceptionally good wines have similar numbers of observations. Comparing what makes for average vs. non-average wine and then comparing both ends of the extremes may give a more complete picture of wine quality.

Finally, the solute columns were changed from mass per volume to percents. Density, by definition is 100% of the solution this is shown in our calculation. We should also not forget the unit conversions we have. While we are working with percents, it helps to be able to report whichever value is most conducive.

1% is equal to:

  • 10,000 ppm
  • 10 g/L
  • 1.48 g per serving

Deeper Investigation

When cleaning up the dataset, a number of questions were raised on the relationships between variables. A matrix synopsis between the solutes should give a good direction on where to go.

Universal Comparisons

From this initial comparison, it seems alcohol content is the best predictor of wine quality. It also suggests the thought of average versus other may not provide the results hoped for. Based on the distribution of alcohol content by quality, there’s more sense in breaking quality as less than average and better than average values. This would regroup the observations with a split between qualities 5 and 6.

The strongest solute correlation is between free sulfur dioxide and total sulfur dioxide. This makes sense as one is a subset of the other. Sulfuric content related to quality does have an interesting phenomena. It appears that lower quality wines have larger range of sulfur dioxide content.

Acidic content also appears to have a slight impact on wine quality. The upper and lower bounds of acidic solutes lessens as quality improves. Total acidic content could be investigated further.

Total solvent and color are missing from the comparisons. They should be added in the next comparison set. Variable ordering should also be adjusted in the next comparison set.

Refined Comparisons

This new set has a higher number of correlated values. As alcohol content increases, total solvent decreases. This suggests that alcohol is a greater portion of the solution than the other solutes. The opposite is true for total acid and fixed acidity. As one increases, so does the other. This suggests that fixed acidity has a higher proportion of total acidity than the other acidic forms.

Residual sugar appears to be consistent through all observations. This is surprising due to the relationship between sugar as a fuel for yeast to turn into alcohol.

Despite the low correlation between pH and total acidity, there does appear to be a loose relationship. As total acidity increases, the pH drops (becomes more acidic). There’s also appears to be a slight relationship between acidity and quality. Fixed acidity, citric acid and pH seems to be consistent through all grade qualities. Volatile acidity seems to have a higher content in lesser quality wines.

We start to see hints at the variation in chemical makeup between red and white wines. White wines tend to have more residual sugar, more sulfur dioxides, and lower pH. The lower pH of white wines contrasts with the lower acidic content of white wine. Remember, lower pH means more acidic. The expectation is higher acidic content would have lower pH. One possible explanation is fixed vs. other forms of acid.

I did note earlier that the range of total sulfur dioxide is greater at lower qualities. However, the mean of sulfur dioxide content does remain constant. The difference in range could be attributed to white wine’s tendency to have a higher concentration of sulfur dioxides than red wines. There’s some logic to save these values for a future analysis. One with greater focus on the compositional differences of red and white wines.

The effect of acidic compounds and pH on quality also appears to be dependent on wine variety. With all of these differences between red and white, It would be good to compare red and white composition separately.

Comparisons of Varietals

Red

White

When splitting out the comparisons by color. It’s easier to see how differences in composition affects quality. Acidic and sulfuric compounds are a greater predictor of wine quality in reds than whites. Across both types, alcohol remains the greatest predictor of wine quality. Acidic and sulfuric compounds will be reserved for future analysis.

Is there something that indirectly influences quality? A relationship between alcohol content and the remaining compounds? What are the ratios of other compounds compared to alcohol content?

Distribution of chlorides (salt) across alcohol seems to match the curve of alcohol content for all types. Less alcoholic wine tends to have more salt. Aside from some outliers, the percent salt content in red and whites appears to be similar.

Despite the low correlation, residual sugar and alcohol content also appears to have a universal trend across the varietals. We can only speculate on the unknown factors that would produce such a result. One factor could be the starting sugar. Another might be the yeast used to convert sugar to alcohol. At what alcohol content does that yeast die? Was the fermentation process interrupted at a certain point? These are questions we don’t have the answers too. We shouldn’t ignore this apparent relationship due to unknown influences.

From all of this, the plots between alcohol, salt and sugar should be the center of our exploration. There appears to be some type of relationship between alcohol and quality. There’s also appears to be a relationship between ratios of salt and sugar to alcohol. Before moving forward, it’s only prudent to take some steps back.

Alcohol and Quality

Distribution of the alcohol content in our samples is right skewed. Are the samples of higher content skewing the quality contents? So far, we’ve been directing our thinking on alcohol content and wine quality by the tendency of higher average alcohol content at higher levels of quality. We should take a closer view at the data to support this assumption.

There’s quite a bit of overplotting in that graph. Still, there is some pattern in the extreme qualities.

Wine samples graded as a 9 can be counted on one hand. There’s still a bit of overplotting. It is more clear to see the clusters of samples rise as quality improves.

Reducing the alpha a bit and the change in the clusters become more clear. It’s also clear we can’t say that wine quality will be better just because the alcohol content is higher. This is different from saying higher quality wines tend to have higher alcohol content. Do not confuse this with saying lower quality wines will always have lower alcohol content.

Taking the bins we’ve defined for lesser and better quality shows a clearer picture. Better quality wines seems evenly dispersed across the range of alcohol content. It looks that the cut off is just above 10% alcohol content. Wines with more than 10% alcohol content are more likely to be of better quality.

Considering we would rather increase than give up resolution of quality, it would be good take a look at the upper and lower quality wines bins from another perspective. To facilitate this, let’s recreate the smaller bins without the added field.

We ended up dropping wines of quality 9. As there are so few observations with that rating, it could be considered outliers. Basically this is the same graph as previously created. The benefit of this method is allowing us to maintain quality granularity within wines of better of lesser quality.

I’m also having some regret in the decision to transform quality into a factor. That does give more control over the scale. That additional control is under utilized when quality is on an integer scale.

Taking a closer look at the breakdown of the better quality wines, We see that wines of quality 6 runs the full spectrum of alcohol content. Disregarding wines of quality 6, better wines start to occur more frequently at alcohol content near 10-11%. Wines of the highest quality have even fewer occurrences with alcohol content under ~9.5%.

The shift for wines of greater vs lesser quality around 10% is more apparent by limiting the view to the middle quality scores. Wines of quality 6 appear to be a merge between wines of quality 5 and 7.

Breaking out the least quality wines shows a dispersion that rarely breaks 11.5% alcohol. If the least quality wines rarely rise above 11.5% alcohol and the highest quality wines rarely have less than 9.5% alcohol, then the average quality wines must fall within that range.

What would happen to our graphs if we eliminated the ordinary wines defined as having alcohol content between 9.5% and 11.5%? This also revisits the idea of average versus non-average wines. We have a definition of what makes for an ordinary wine. Eliminating those from our views, it should become easier to see the differences at the extremes.

The cut off between better and lesser quality wines is even easier to see when we remove the middle range of alcohol content. This split more clearly occurs at some granularity of quality rated 6.

It’s now clear that quality 6 wines are of ambiguous quality. We should look back at the lesser vs better quality graph while eliminating the ambiguous value.

It is now a quite clear relationship between quality and alcohol content. We should remember the quality scale is subjective and based on the opinion of experts. It’s possible those experts have a preference for higher alcohol content in their wines. On the observations made this far, I would suggest that preference is likely.

As part of due diligence, we should a look at actual value correlation between quality and alcohol.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$alcohol and as.numeric(wines$quality)
## t = 39.97, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4245892 0.4636261
## sample estimates:
##       cor 
## 0.4443185

Across all samples, there is a moderate correlation between quality and alcohol content. This correlation includes the ambiguous quality rating. This is a central value that fits the model but skews the data. This skew appears to be because of a lack of resolution in the quality scale. What happens to correlation if we omit this ambiguous score?

## 
##  Pearson's product-moment correlation
## 
## data:  wines.noMiddle$alcohol and as.numeric(wines.noMiddle$quality)
## t = 41.294, df = 3659, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5413057 0.5855142
## sample estimates:
##       cor 
## 0.5638137

By omitting the ambiguous quality value on our scale, the correlation becomes strong. I would hypothesize, if we had a more granular and less subjective quality scale the correlation between would be even stronger. For now, I will state that alcohol is an acceptable measure of quality. It lends greater granularity than the current quality scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.30   10.49   11.30   14.90

The earlier observations on the upper limit of alcohol content for lower quality wines and lower limit for higher quality wines fall almost exactly on the 1st and 3rd quartiles of the alcohol content range. This gives even more confidence in the assertion that alcohol is the measure of quality.

Intermediate Analysis

The quality score provided by the experts lacks in granularity. Luckily, the experts tend to rate wines of higher alcohol content more favorably. By equating alcohol content with quality, we have better granularity for determining the composition of higher quality wines.

On the compositional side of things, salt and sugar seem to have the most consistent relationship to alcohol across both varietals. There are compounds, such as citric acid, with stronger relationships to alcohol content. Those relations do not appear to be nearly as consistent across varieties.

Moving forward in our exploration, we’ll focus on alcohol, salt, sugar, and wine variety. The other interesting variables will be saved for future analysis.

Deeper Look

To complete our investigation, we’ll start by investigating how salt concentration compares at various alcohol levels.

Salt

The majority of wines have less than a hundredth of a percent of salt. There’s a jump in salt content around 9.5 percent alcohol. Most of them appear to be outliers. Some scaling and subsetting should produce a better picture.

There’s a consistent fraction of a fractional drop in salt content from ~0.0047% to ~0.0035% as alcohol content increases from ~9% to ~13%. These values extend just outside the ~9.5% to ~11.5% range we previously determined as average quality. As wine quality increases, salt content decreases.

What about sugar content?

Sugar

Again, the data is condensed on itself. Some adjustment on perspective can only help.

There’s a large amount of residual sugars in wines of low alcohol content. This seems logical. Sugar is what yeast eats to make alcohol. Less sugar consumed means more left over sugar and less alcohol.

There appears to be an interesting bump in sugar content near the delineation of average and better than average wines. Of all the possibilities that could explain that bump, the one we have is variety.

That doesn’t answer anything about the bump. If anything, the bump is more pronounced. The interesting thing it did illustrate are the differences in red and white wines. It seems there’s two graphs of sugar content stacked on each other.

The lower graph is comprised of the red wines. Which shows a consistent level of residual sugar content at all levels of alcohol. In the upper graph of white wines, sugar drops sharply as alcohol increases before leveling out at the highest levels.

Salt Revisited

Is there a similar phenomenon in salt content?

White wines have less salt than reds. White wines also have a steeper drop in salt content as alcohol content increases compared to red. A similar phenomenon except red wines tend to have more salt than white wines.

Sugar in Reds

Returning to sugar content, It might be helpful to break reds and whites to their individual scales.

Unlike white wines, red wines appear to have a slight increase in sugar content as alcohol increases. It’s so slight, starting around 0.225 and ending near 0.25. Given the low sample size and lack of measurement error, the best we can say is red wine has consistent sugar content across all values. There may be some miniscule differences, but there’s not enough information to say sugar content is a contributing factor in red wine quality.

To be sure, we should check against the original quality scale.

The majority of values reside under 0.4% sugar content. Adjusting both graphs might yield a better image. Before making this adjustment, it’s worth noting the 1st quartile and mean is fairly consistent across quality range. There is some variation in the 3rd quartile, but no consistent pattern.

Removing these outliers produced a more consistent pattern in both graphs. The sugar content for the range of alcohol content is more flat. Sugar content at each quality score has a consistent mean with variation occurring in the 1st and 3rd quartiles.

As an aside, the similarities in these graphs reinforces our earlier assertion that alcohol content is a good predictor of wine quality. Unfortunately, these findings are contrary to our earlier observations that sugar content related to alcohol has a consistent relationship across varietals.

Putting it All Together

One final graph I want to develop is alcohol content at ratios of salts and sugar.

What we see is a concentration of higher alcohol content at lower salt and sugar content. Some of that could be attributed to the intersection of red and white wines.

Red wine tends to have a much lower sugar content and higher salt content than white wine. That could explain the lower leg of this graph.

Another thing to consider is the range of granularity alcohol content provides. This spread of alcohol content across the ratios of sugar to salt could become more apparent by introducing more colors to our graph.

At three colors, it seems more clear that lower alcohol content wines tend to occur at the extreme values of sugar and salt content.

At four colors, diminishing returns start to kick in. It does help solidify what we’ve already seen. There’s slightly more resolution on the ratios of sugar and salt where the highest alcohol content wines occur. The highest concentration of red dots appear near 0.0025% salt and 0.25% sugar.

It’s worth noting that what has been referred to this far as a ratio between two compounds is a misnomer. It describes 100 grams of sugar for every gram of salt. We do see a higher concentration of the higher alcohol content wines around 0.3% sugar and 0.003% salt. We do not see any examples at 2% sugar and 0.02% salt. There is no diagonal line that would describe a perfect ratio.

To be more accurate, we should say that the highest alcohol content wines occur between 0.003% and 0.004% salt content and 0.1 to 0.75% sugar content. Looking closer at this range could provide additional value.

The Happy Spot

Zooming in on this range we can see that it isn’t a true ratio. It’s more of a happy spot where the balance between salt and sugar isn’t too much in one direction or the other. There’s a bit of noise at this scale with 3 colors.

Gradation of one color from black to white provides a good contrast. We can better see where there’s not enough sugar for the amount of salt in that bright blue. Just above the bright blue, we see a distribution of higher alcohol content wines extending into higher sugar content.

Zooming back out with this color scale, the happy spot is more pronounced. The trade off is a loss of pronouncement in the extreme values.

Red vs. Whites

One thing that we’ve touched on previously are the composition of varietals and how they differ even where there’s similarities.

Red wine’s contribution to the happy spot is clear and not unexpected. Red wines have a consistent sugar content and typically have move salt. As such they make up the lower leg of the salt to sugar distribution.

White wine’s contribution to the happy spot is also quite clear. Although, white wine seems to be more forgiving on sugar content. Instead of saying white wines are more accepting of high sugar content; we should say that higher levels of salt need lower sugar content to remain at high quality.

Multivariate Analysis

Based on what we’ve seen, salt content has more impact on wine quality than sugar. Even with sugar rich white wines, salt content is the largest factor for determining quality. In white wines, you can have more sugar in a higher quality wine as long as the salt content remains in a minimal range.

Between red and white wines, there is a convergence where wines of the highest alcohol content occurs with a balance of sugar and salt. This sweet spot provides some answers to the question of how wine composition relates to quality. What is not clear is how alcohol content and the balance of salt and sugar relate to quality. Are the ratios of salt and sugar the ultimate determinant to quality or does the alcohol content also contribute.

Alcohol content as a granular measure of quality is something that shouldn’t escape scrutiny. In this analysis, alcohol content was taken as a surrogate for quality. It offered a level of granularity not present in the quality scale. One issue with this is it may hide the relationship of other solutes in the solution and how the combination of those solutes relate to quality.

Final Plots and Summary

Quality

Using Alcohol as Quality

The data set has an issue of granularity within the quality scale. Without a a greater level of granularity, the delineation between good an bad quality wines is lost. Alcohol content was identified as a surrogate to quality. It offers a greater level of granularity.

This granularity and logic of alcohol content as quality is illustrated here. The exceptionally good and exceptionally bad wines have a lack of observation density to produce consistency with this assumption. However, for the observations we do have, we see the mean of alcohol content at lesser quality wines are much lower than higher quality wines.

Salt Content of Varietals

Quality of Salt in Wine

Using alcohol as a granular measurement of quality revealed salt content as the major factor in determining wine quality. Across both varietals, as alcohol content increases, salt content decreases. The better wines have less salt.

The amount of salt in the wine samples is miniscule. Where alcohol can be measured in full percents by volume, salt is better described in parts per million. It is interesting that slight variation in a small amount can have such a great effect on quality.

We also start to see a divergence between the two main varietals of wine. Red wines tend to have almost twice the amount of salt of white wines. This suggests that quality is reliant on a combination of compounds.

Alcohol, Salt and Sugar Content

Composition of Varietal Quality

We now start to see the relationship between combinations of compounds and quality. Also apparent are the differences and similarities between red and white wines. There’s a convergence between both varietals for ratios of salt and sugar where the highest alcohol content occurs. This sweet spot for salt and sugar is represented in the darkened area of the graph.

There’s a range of salt that is acceptable for reds and whites. There’s a similar range for sugar as well. The difference for varietals is which solute is more forgiving in quality. More salt is acceptable in red wines as long as it remains within a limitation of sugar. The inverse is true for white wines.

Reflection

One thing that strikes me about this data is how worthless the quality score is. Asking someone to rate something on a scale of one to whatever does offer some insight. The question here is, how many opinions using that scale is needed to get a usable granularity? It’s a great starting point with limited use.

Despite the limitations of quality score, the dataset does offer a depth of possibilities. Differences of composition between varietals absent of quality considerations is the top of my list. While alcohol content was chosen as a surrogate to quality, it still is a measure of alcohol content. It could be that alcohol is a good measure of quality. I would argue that the polled experts have a bias for alcohol content. What makes them experts and what criteria do they use in rating decisions?

There’s also many items noted in the initial pass of the data that were not explored further. One compound not explored and deserving a future analysis are the acids. There’s enough evidence to explore the relationship between citric acid, sugar and salt. There’s also an expectation that Ph is influenced by acidic compounds. Expectations should always be tested.

At the end of it, the gleaned insight is something I look forward to testing. Next time I’m in the store for a bottle of wine, I’m going to use alcohol content as my primary deciding factor. It will be interesting how many bottles go through before it proves to be a bad metric.